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^ ■ We prove uniform consistency of Random Survival Forests (RSF), a newly in- 

! traduced forest ensemble learner for analysis of right-censored survival data. 

^ \ Consistency is proven under general splitting rules, bootstrapping, and ran- 

O ' dom selection of variables — that is, under true implementation of the method- 

^ , ology. A key assumption made is that all variables are factors. Although 

00 | this assumes that the feature space has finite cardinality, in practice the space 

can be a extremely large — indeed, current computational procedures do not 
properly deal with this setting. An indirect consequence of this work is the 
f-H | introduction of new computational methodology for dealing with factors with 

' unlimited number of labels. 

> 

1 Introduction 

Out of the machine learning community have emerged many different types of 
learning algorithms. Some of these have excited tremendous interest because of 
t^j- \ their performance over benchmark data. With minimal supervision, these algo- 

rithms outperform standard methods in terms of prediction error — in some instances 
the difference in prediction error is so substantial there seems no way to bridge the 
gap. One of the most exciting algorithms to have been proposed is Random Forest s 



OO (RF), an ensemble learning method introduced by Leo Breiman (|Breimanl |2001|) . 

RF is an all-purpose algorithm that can be applied in a wide variety of data settings. 
In regression settings (i.e. where the response is continuous) the method is referred 
^ ■ to as RF-R. In classification problems, or multiclass problems, where the response 

is a class label, the method is referred to as RF-C. Recently the methodology has 
also been extended to right-censoring survi val settings, a method called random 



survival forests (RSF) (|lshwaran et all 120081) 



RF is considered an "ensemble learner". Ensemble learners are predictors formed 
by aggregating many individual learners (base learners), each of which have been 
constructed from different realizations of the data. There has much interest in en- 
semble learners because they have been shown in many instances to outperform the 
individual learners they are constructed from — although why this happens is not ye t 



fully understood. A widely used ensemble technique is bagging (IBreimanlJ19 96a). 
In bagging, the ensemble is formed by aggregating a base learner over independent 
bootstrap samples of the original data. If the base learner is a classification tree, then 
a classification tree is fit to each bootstrap sample, and the ensemble classifier is de- 
fined by taking a majority vote over the individual classifiers. If the base learner is 
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a regression tree, the ensemble is the averaged tree predictor. Breiman sh owed that 
the im proved performance of a bagged predictor is related to instability (|Breiman , 
1996allbh . If the base learner is unstable, with low bias, then the ensemble will 



have better performance. However, bagging a stable learner can sometimes degrade 
performance. 

Al though there are many variants of RF (|Amit and Gemanl 1 19971 : iDietterichL 



2000|; ICutler and Zhaol |2001|) . the most popular, and the one we focus on here, is 



that described by Bre iman in h i s softw are manual (IBreimanl . 120031) . This algorithm 
was also discussed in iBreimanl (|200 1|) under the name Forest-RI for RF random in- 
put selection, and it is also the algor it hm im plemented in the R- software packages 



randomForest(Liaw and Wienerl. 120021) . and randomSurvivalForest 



(|lshwaran and Kogaluii 120071) . In this version, RF can be viewed as an extension 
of bagging. Using independent bootstrap samples, a random tree is grown by ran- 
domly selecting a subset of variables (features) to be used as candidate variables for 
splitting each node. The forest ensemble is constructed by aggregating over the ran- 
dom trees. The extra randomization introduced in the tree growing process is the 
crucial step distinguishing forests from bagging. Unlike bagging, each bootstrap 
tree is constructed using different variables, and not all variables are used. This is 
designed to encourage independence among trees, and unlike bagging, it not only 
reduces variance, but also bias. While this extra step might seem harmless, results 
using benchmark data have shown that prediction error for RF can be substantially 
better than bagging. In fact, performance o f RF has been found compa rable to other 
state-of-the-art methods such as boosting (Freund and ShapireUl996l) . and support 
vector machines (ICortes and Vapniklll995l) . 

Why RF works is still somewhat of a mystery. Developing theory is difficult 
because although each step used in RF is fairly simple, when combined, they re- 
sult in an algorithm that is hard to analyze analytically. In his seminal 2001 pa- 
per (|Breimanl 1200 II) . Breiman discussed bounds on the generalization error for a 
forest as a trade-off involving number of variables randomly selected as candidates 
for splitting, and the correlation between trees. He showed as number of vari- 
ables increases, strength of a tree (accuracy) increases, but at the price of increas- 
ing correlation among trees, which degrades overall performance. It is unclear, 
ho wever, how ti g ht the se bounds are — Breiman himself noted they might be loose. 
In Lin and Jeonl (|2006T) . lower bounds for the mean-squared error for a regression 
forest were derived under random splitti ng by drawing analog ies between forests 
and nearest neighbor classifiers. Re cently, Meinshausen ( 2006b proved consistency 
of RF-R for quantile regression, and lBiau. Devroye. and Lugosil (|2008l) proved con- 
sistency of RF-C under the assumption of random splitting. 

In this paper, we prove uniform consistency of RSF As this is a new extension of 
RF to right-censored survival settings not much is known about its properties. Even 
consistency r esults for survival trees are sp arse in the literature. For right-censored 
survival data. LLeBlanc and Crowleyl(|l993|) showed survival tree cumulative hazard 
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functions are consistent for smoothed cumulative hazard functions. The method of 
proof used convergence results for recursive-partitioned regression trees for uncen- 
sored data lBreiman et.all (|1984l) . 

We take a different approach and establish consistency by drawing upon count- 
ing process theory. We first prove uniform consistency of survival trees, and from 
this, by making use of bootstrap theory, we prove consistency of RSF (Section 4). 
These results apply to general tree splitting rules (not just random ones) and to true 
implementations of RSF. We make only one important assumption: that the fea- 
ture space is a finite (but very large) discrete space and that all variables are factors 
(Section 3). In this regard we deviate from other proofs of forest consistency. These 
proofs all assume that the feature space is continuous; but this is problematic for 
two reasons. First, it requires strong assumptions about the splitting rule which may 
not be reflected in practice. Secondly, data in most applied statistical problems, es- 
pecially those seen in medical settings, contain a mixture of both continuous and 
categorical data. Because node splitting for categorical variables is fundamentally 
different than for continuous variables, theory proven under a continuous feature 
space paradigm do not directly apply to most data settings. 

On the other hand, to be fair to these approaches, continuous variables are often 
encountered in practice. Thus, it is natural to wonder if an assumption of a discrete 
feature space limits our results. We show in Section 5 that embedding forests in a 
discrete setting is realistic in that one can analyze problems with continuous vari- 
ables by treating them as factors having a large number of factor labels. Indirectly, 
this addresses an unresolved issue related to forests and trees: namely, how to split 
factors with a large number of labels. We introduce new computational methodol- 
ogy for addressing unlimited number of labels for factors. For the interested user, 
we note that all computations given in the paper can be imple mented using the freely 
available R- software package, randomSurvivalForest (|lshwaran and Kogalm . 



2003,12008). 



2 Random survival forests algorithm 

We begin with a high-level description of the RSF algorithm. Specific details fol- 
low. 

1 . Draw B independent bootstrap samples from the learning data and grow a 
binary recursive survival tree to each bootstrap sample. 

2. When growing a survival tree, at each node of the tree randomly select p 
candidate variables to split on (use as many candidate variables as possible, 
up to p — 1, if there are less than p variables available within the node). 
The node is split using the split that maximizes survival difference between 
daughter nodes (in the case of ties, a random tie breaking rule is used). 
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3. Grow a tree as near to saturation as possible (i.e. to full size) with the only 
constraint being that each terminal node should have no less than d > 
events. 

4. Calculate the tree survival function. The forest ensemble is the averaged tree 
survival function. 

The tree survival function calculated in Step 4 is the Kaplan-Meier (KM) es- 
timator for the tree's terminal nodes. This can be explained formally using the 
following notation. Let 2? denote the terminal nodes of a survival tree, T. These 
are the extreme nodes of T reached when the tree can no longer be split to form 
new nodes (daughters). Let (T 1>h , 6 1>h ), (T mW)h , 5 m ( h ), h ) be survival times and 
binary {0, 1} censoring variables for cases (individuals) in a terminal node h E 2? . 
An individual % is said to be right-censored at time T i h if 5 iy h = 0; otherwise, if 
5 ijh = 1, the individual is said to have experienced an event at T ith . Let < 
h,h < ■ ■ ■ < tM(h),h be the M(h) distinct event times. Define d^h to be the num- 
ber of events at time t^h and Y^h to be the number individuals at risk just prior to 
time ti t h- The cumulative hazard function (CHF) estimate for h is the Nelson- Aalen 
estimator 

ti, h <t l ' n 

Note that all cases within h are assigned Hh(t). 

For later theoretical development it will be helpful to rewrite H h {t) using count- 
ing process notation. Let the predictable function Y^(t) = YlT=i I(Ti,h > t) be 
the number of individuals in h observed to be at risk just prior to t, and let (t) 
be the counting process for h defined as the number of events in [0, t]. Define the 
indicator process J^\t) = I(Y^ n \t) > 0). Then the Nelson- Aalen estimator for h 
may equivalently be written as 

ft t(")/ \ 

Jo Y^ >(s) 

where we adopt the convention that jj?\s) / Y^ n \s) = whenever Y^ n \s) = 0. 
The KM estimator for h is 

s<t s<t \ Y h \ S ) J 

Each individual i has a d-dimensional feature Xj. To determine the survival 
function for i, drop Xj down the tree. Because of the binary nature of a survival 
tree, x, will be assigned a unique terminal node h E ST . The survival function for i 
is the KM estimator for Xj's terminal node: 

S{t\xi) = S h (t), if x, E h. 
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Note this defines the survival function for all cases and thus defines the survival 
function for the tree, T. To make this clear, we write the tree survival function as 

S(t\x i ) = Y l l(x i eh)§ h (t). 

3 The feature space 

In establishing consistency of random survival forests we assume that each coordi- 
nate 1 < j < d of the d-dimensional feature X is a factor (discrete nominal vari- 
able) with 1 < Lj < oo distinct labels. While this assumes that the feature space 

has finite cardinality, the actual size of S£ can be quite large, L x x • • ■ x L d , and 
moreover, the number of splits that a tree might make from such data can be even 
larger, depending on d and (Lj)j =1 . 

To see this, note that a split on a factor in a tree results in data points moving 
left and right of the parent node such that the complementary pairings define the 
new daughter nodes. For example, if a factor has three labels, {A, B, C}, then there 
are three complementary pairings (daughters) as follows: {A} and {B,C}; {B} 
and {C, A}; and {C} and {A, B}. In general, for a factor with Lj distinct labels, 
there are 2 ij ~ 1 — 1 distinct complementary pairs. Thus, the total number of splits 
evaluated when splitting the root node for a survival tree when all variables are 
factors can be as much as 

d 

maximum number root-node splits = 2 Lj ~ 1 — d. 

3=1 

Following the root-node split, are splits on the resulting daugther nodes, and their 
daughter nodes, recursively, with each subsequent generation requiring a large num- 
ber of evaluations. Each evaluation can result in a new tree, thus showing that 
number of trees (space of trees) associated with 3£ can be extremely large. 

4 Properties of survival forests 

A reasonable criteria for consistency of a random survival forest is that the ensemble 
survival function converges to the population survival function. We first consider 
conditions needed for consistency of a survival tree. Consistency of forests are then 
deduced by utilizing bootstrap theory. 

4.1 Assumptions 

Let (X, T, 5), (Xi, Ti, 5i), . . . , (X n , T n , S n ) be i.i.d. random elements such that X, 
the feature, takes values in i?f , a discrete space as described in Section 3. Here T = 
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min(T°, C) is the observed survival time and 5 = I(T° < C) is the binary {0, 1} 
censoring value, where it is assumed that T°, the true event time, is independent 
of C, the censoring time. Furthermore, it is assumed that X is independent of 5. 
The collection of values {(Xj, T h 5«)}" =1 are referred to as the learning data and 
are used in the construction of the forest (recall the algorithm of Section 2). It is 
assumed that (X, T, S) has joint distribution P. The marginal distribution for X is 
denoted by fi and defined via fJ,(A) = P{X G A} for all subsets A of 3E . It is 
assumed that [i(A) > for each A ^ 0. 

The true survival function, or population parameter, is assumed to be of the form 

S(t\X) : = P{T° > t\X} = ^ /(X = x) exp (- [ a(s|x) ds] , (2) 

xe,«r V Jo J 

where a(- |x) is the non-negative hazard function for the subpopulation X = x. 

4.2 Uniform consistency of survival trees 

The following result, showing uniform consistency of a survival tree, is a conse- 
quence of the uniform consistency of the KM estimator. Let r = min{r(x) : x G 
^T}, where r(x) = sup{t : J Q * a(s\x)ds < oo}. 

Theorem 1. Let t G (0, r). IfF{C > 0} > 0, and a(-|x) is strictly positive over 
[0, t] for at least one x G J% ', then 

sup l/^S^slX)) — jt/(^(s|X))| -!-» 0, as n — > oo. 

s€[0,t] 

(A^ote ?/ia? we are using the linear functional notation for expectation in the above 
expression). 

4.3 Uniform consistency of survival forests 

A random survival forest is a collection of random survival trees, each grown from 
independent bootstrap samples of the learning data, Jzf = {(Xj, T h <5j)}" =1 . Thus, 
in order to prove consistency of RSF, we must extend our previous results to include 
bootstrap resampling. 

Let ££* = {(X*, T*, <5*)}™ =1 denote a bootstrap sample of the learning data. Let 
T* be the survival tree grown from ££* and let 5*(t|x) be the KM estimator for T*\ 

5*(t|x)= I(xeh)§Xt), 

where S^(t) is defined similar to ®, and the above sum is over 2?*, the set of 
terminal nodes of T*. 
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A random survival forest comprised of B survival trees has an ensemble survival 
function 

1 3 

£e(*|x) = -^£ 6 *(t|x), 
6=1 

where S^(t\x) is the survival function for the survival tree grown using the 6-th 
bootstrap sample. We show consistency of RSF by establishing consistency of 
/x(S£(i|X)) for each b (since then consistency of the ensemble, fi(S e (t\X.)), holds 
automatically). 

Theorem 2. Let r* = min(r, sup(F)), where sup(F) is the upper limit of the 
support ofF(s) = 1 — P{T° > s}P{C > s}. Then under the same conditions as 
in Theorem\I\ for each t e (0, r*); 

sup HS*(s\x)) - p(S(s\x))\ = ;(i) + 0p (i), 

se[o,t] 

where o* stands for o p in bootstrap probability for almost all -sample sequences; 
i.e. with probability one under 

4.4 Uniform approximation by forests 

Theorem [2] establishes consistency of a bootstrapped survival tree, and from this 
consistency of a survival forest follows. While this is a useful line of attack for 
establishing large sample properties of forests, it does not convey how in practice a 
forest might improve inference over a single tree. Indeed, in finite sample settings, 
a forest of trees can have a decided advantage when approximating the true survival 
function. 

To show this we make use of the following, somewhat idealized, setting. Sup- 
pose that for each b we are allowed to construct a binary survival tree T b from a 
prechosen learning data set ££ b = {(X 6)i , T b i , 6b t i)}f =1 in any manner we choose. 
The only constraint being that each terminal node of T& must contain at least d = 1 
events. Let S&(i|x) be the KM tree survival function for T b , and let 

B 

S e {t\x) =^2w b S b (t\x) (3) 

6=1 

be the ensemble survival function for the forest comprising (T b ) b=1 , where W b > 
are forest weights. The next theorem shows that one can always find an ensemble 
that uniformly approximates the true survival function ©. Trees do not possess this 
property. 
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Theorem 3. Ifn>d, and s £ [0, r), then for each e > there exists an ensemble 
survival function a survival forest comprised of B = B(e) survival trees, 

with each tree consisting ofd+1 terminal nodes, such that 

[ [ {S e {t\x)-S(t\x)) 2 ij(dx)dt<s. 
Jo J X 



5 Empirical results 

Our theory has been based on the assumption that all x- variables are factors, but in 
practice one often encounters data with continuous variables. Here we show that 
one can discretize continuous variables and treat them as factors without unduly af- 
fecting prediction error and inference: thus showing our theory can be extrapolated 
to general data settings. At the same time, an indirect consequence of this work is 
the introduction of new computational methodology for efficient splitting of factors 
and for dealing with factors with unlimited number of labels. 

For illustration we consi der the primary biliary cirrhosis (PBC) data of 



Fleming and Harrington! (|1 99 lh . The data is from a randomized clinical trial studing 
the effectiveness of the drug D-penicillamine on PBC. The dataset involves 312 in- 
dividuals and contains 17 variables as well as censoring information and time until 
death for each individual. Of the 17 features, seven are discrete and 10 are continu- 
ous. Each of the 10 continuous variables were discretized and converted to a factor 
with L labels. We investigated different amounts of granularity: L — 2, . . . , 30. 

For each level of granularity, L, we fit a survival forest of 1000 survival trees 
using log-rank splitting with node-adaptive random splits. Splits for nodes were 
implemented as follows. A maximum of "nsplit" complementary pairs were cho- 
sen randomly for each of the p randomly selected candidate variables within a 
node (if nsplit exceeded the number of cases in a node, then nsplit was set to the 
size of the node). Log-rank splitting was applied to the randomly selected com- 
plementary pairs and the node was split on that variable and complementary pair 
maximizing the log-rank test. Five different values for nsplit were tried: nsplit= 
5, 10, 20, 50, 1024. All computation s were implemented using the R- sof tware pack- 



age, randomSurvivalForest (|lshwaran and Kogaluill2007Ll2008[) 



The top plot in Figure 1 shows out-of-bag prediction error as a function of gran- 
ularity and nsplit value. As granularity rises, prediction error increases — but this 
increase is reasonably slow and well contained with larger values of nsplit. This 
is quite remarkable because total number of complementary pairs with a granular- 
ity level of L = 30 is on order 2 30 (over 1 billion pairs) and yet our results show 
that using only 50 randomly selected complementary pairs keeps prediction error 
in check. 

Prediction error measures overall performance, but we should also consider how 
inference for a variable is affected by increasing granularity. To study this we 
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Figure 1: RSF analysis of PBC data using 1000 trees with random log-rank split- 
ting where variables, both nominal and continuous, were discretized to have a max- 
imum number of labels (factor granularity). Top figure is out-of-bag prediction er- 
ror versus factor granularity, stratified by number of random splits used for a node, 
nsplit. Bottom figure shows 68% bootstrap confidence region for variable impor- 
tance (VIMP) from 1000 bootstrap samples using an nsplit value of 1024 for each 
factor granularity value in the top figure. Color coding is such that the same color 
has been used for a variable over the different granularity values (factor granularity 
for a variable increases going from top to bottom). 
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looked at variable importance (VIMP). VIMP measures predictiveness of a variable, 
adjusting for all other variables. Positive value s of VIMP indicating predictiveness, 



and negative and zero values indicating noise (llshwaran et all 120081) . For each for- 



est used in Figure 1 we dropped bootstrapped data down the forest and computed 
VIMP for each variable. This was repeated 1000 times independently resulting in 
a bootstrap distribution for VIMP. The bottom plot of Figure 1 displays the 68% 
bootstrap confidence region from this distribution. The analysis was restricted to 
only those forests grown under an nsplit value of 1024 but was carried out for each 
level of granularity as in Figure 1 (color coding scheme used to depict granularity 
is described in the caption of the figure). Overall, one can see that the bootstrap 
confidence regions are relatively robust to the level of granularity. 



5.1 Remarks 

1. Not only does random splitting maintain good prediction error performance, 
but it may actually help mitigate node-splitting bias. It has been noted in 
the literature that forests have a tendency to favor continuous variables and 
factors with a large number of labels because of a type-I error effect — that 
is, with more values to split on, there is an increased probability of find- 
ing a spurious effect. To investigate the effect to which random splitting 
can alleviate such bias we expanded the PBC data to include 50 independent 
noise variables. 25 of these were randomly simulated from a standard nor- 
mal distribution, thus representing continuous variables, the other 25 were 
discrete variables, randomly simulated from a two-point distribution having 
equal probability for each class. 

The data was discretized and the bootstrap VIMP distribution for each vari- 
able calculated as in the previous section (in total there was 67 variables). As 
in the previous section granularity levels of L — 2, . . . , 30 were investigated. 
The analysis was restricted to an nsplit value of 1024. 

The 68% bootstrap VIMP confidence regions are depicted in Figure 2 (results 
are displayed similar to the bottom plot of Figure 1). Clearly VIMP distri- 
butions for continuous noise variables are wider than that for discrete noise 
variables. However, nearly all distributions for continuous noise variables 
contain zero, even for high levels of granularity, clearly showing that random 
splitting is helping to mitigate selection bias. 

2. Variable selection by purposefully adding n oise variables has been s ucces- 



fully applied to certain regression models (IWu. Boos and Stefasnkii 120071) 
and the results from Figure 2 suggest a similar idea may work for forests. 
One implemention would be to introduce a reasonably large number of noise 
variables and use the combined bootstrap distribution for the variables to de- 
termine a VIMP threshold (combining the distributions should yield a more 
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Figure 2: Bootstrapped VIMP from PBC analysis where data has been expanded to 
include 25 continuous and 25 discrete binary noise variables. Discrete noise vari- 
ables encoded using a "d", whereas continuous noise variables have codes starting 
with "c". VIMP calculated and displayed as in bottom plot of Figure 1. 
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stable threshold value). A nice feature of forests is that the inclusion of even 
a fairly large number of noise variables should not impact prediction error or 
the VIMP of non-noise variables. 



Although our examples considered factors with no more than 30 labels, we 
are able to implement splitting on factors with unlimited number of labels in 
randomSurvivalForest. Here is a brief description of this methodol- 
ogy- 

The most basic issue is how to represent a split on a factor. Here we emulate 
randomForest, the R- softwar e package based on Breim an and Cutler and 



ported to R by Liaw and Wiener (|Liaw and Wieneru2002|) . The strategy is to 



immutably map the labels in the factor to the bit positions of a 32-bit integer. 
A split is then uniquely defined by moving all labels corresponding to bits 
ON (equal to one) to the left daughter and moving the rest, corresponding to 
bits OFF (equal to zero) to the right daughter. Note that the left daughter and 
right daughter define complementary subsets of the factor labels. 



For deterministic splits on a factor with 32 labels, all possible complementary 
pairs of subsets are enumerated. In general there are 2 L — 1 such pairs. It is 
clear that, with L = 32 labels, enumerating the number of complementary 
pairs explicitly becomes memory intensive. In fact, on an architecture in 
which an unsigned integer is a 32-bit word, the maximum value representable 
is a factor with 32 labels. 



Deterministic splitting requires that we construct the binary representation of 
all the possible complementary pairs in a factor — but this is memory inten- 
sive. Our solution is to allow factors with unlimited labels, but to restrict 
factors with labels greater than 32 labels to node-adaptive random splitting 
(with nsplit set to the cardinality of the working node). We thus completely 
avoid the overhead of enumerating the complementary pairs by constructing 
and discarding each complementary pair individually. 

Once a complementary pair is identified for splitting a tree node, representing 
the split point in its binary complementary pair format is an extension of the 
single word case. We simply use an array of 32-bit unsigned integers that is 
sufficient in length to represent all the labels in the factor. For example, a 
factor with 512 labels requires a vector composed of 16 unsigned integers. 
We refer to this vector as a multi-word complementary pair (MWCP). By 
a simple book-keeping mechanism, we are able to store the vector and its 
length, and are thus able to recover the split point for later use (for example 
for prediction on test data). 
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6 Proofs 

Proof of Theorem [7] Independence of T° and C ensures that 

P{5 = 1} = P(P{T° < c|C = c}) > P{T° < c}P{C > c}, for any c> 0. 

The assumption P{C > 0} > implies that the censoring distribution has mass 
bounded away from the origin. Thus, we can find a c > such that P{C > c] > 0. 
The assumed form of the survival function © ensures that the distribution func- 
tion for T° is continuous over [0, t]. Combining this with the assumption a(-|x) is 
strictly positive for some x, implies that P{T° < c} > 0, and hence, F{5 = 1} > 0. 
Recall that a survival tree is grown to full length with the proviso that a terminal 
node should have no less than do > events. Let A C S£ be any non-null set. 
Then by the law of large numbers, and by the assumed independence of X and S, 

n 

- V I(Xi eA,6i = l)^> P{X e A, 5 = 1} = fi(A)F{5 = 1} > 0. 

n z — ' 

i=i 

Therefore, 

/ ( X)/(Xj e A, Si = 1) > d j ^ l. (4) 



=i 



Thus, the constraint that a terminal node must have do events holds almost surely 
if the terminal node is A. Furthermore, because the survival tree is grown to full 
length, this shows that A must be a distinct value x e 36 '. If it were not, then this 
would imply the tree stopped splitting at a node comprised of more than one x, 
because the requirement of do > events could not be met under any split. That is, 
any split on this node yields daughters A\ and A 2 with at least one daughter having 
fewer than do deaths. But this contradicts © which holds for any A. Thus, the tree 
almost surely splits on all possible values of SE and has terminal nodes for each 
distinct x G SE . In other words, 

S(s\X) = ^ /(X = x = h)S h (s) + o p (l). 

Note that the o p (l) term is uniform in s. 

Let y( n )(s|x) = J27=i H T i > s,X i = x) be the number of cases with feature 
x who are at risk just prior to s. Define j( n )(s|x) = J(F( n ^(s|x) > 0). Now if we 
can show that for each x G 3£ , as n — > oo, 

1 1 j a( S |x)d S ^0, (5) 



[8 X 



and 



[ (1- J (n) (s|x))a(s|x)ds^0, (6) 
Jo 
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then by Theorem IV.3.1 ([Andersen et al.l . ll993|) . for each h — x: 



sup |<S/i(s) — ^(slx)! 0, as n — > oo. 
se[o,t] 



This would establish the result, because 

sup |/i(5(s|X)) -n(S(s\X)\ 

s£[0,t] 



< V]/i{X = x} sup \S h (s) - S(s\x)\ + o p (l). 
se[o,t] 



/i=x 



Therefore, to complete the proof we need only to verify that conditions © 
and © hold. By the definition of r, sup se j ^ a(s|x) < oo and thus a sufficient 
condition for © and © is that 

inf F (n) (s|x) -4 oo. 

se[o,t] 

This condition holds by noting that for each s E [0, t], 

1 n 

n"V")( S |x) > -r/(T i >^=l ) X i = x) 
n ^— ' 



t=i 



^ M (X = x)P{T° > t|x} > 0. 
Note that P{T° > i|x} > because of © and the definition of r. 



□ 



Proof of Theorem [2l Let (M* M* n ) T be a multinomial random vector 
from n trials in which each cell has probability 1/n (as customary we use a "*" to 
indicate that randomness comes from bootstrapping). For each non-null ACf, 



1 n i n 

- V /(X e A, 5, = 1) + - V /(X eA,6i = l)(K,i " 1) 



n 



i=l 



n 



i=l i=l 

= P{Xe A,5= 1} + ;(1), a.s., 
where the almost sure convergence is in P-probability and the o*(l) term follows 
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from 



' n 

^/(x,eA^i)(M;,-i) 



,1=1 



1 n 1 



i' n,jj 



= 0{n~ 1 ). 

In the proof of Theorem [TJ it was shown that P{X E A, 5 = 1} > 0. From this, and 
ussing similar arguments as in that proof, it follows that 



S* (s|X) 



I(X = x = h)S* h (s) + o*Jl] 



where the o*(l) term is uniform in s. Notice that 

KS*(s\X)) - im(S(s\X.)) 

= ~pl(S*(s\X))-pl(S(s\X)) 



fi(S(s\X)) - »(S(s\X)) 



The second term in square brackets is o p (l) uniformly in s by Theorem [TJ To 
deal with the first term, we use the representation for S'(slX) given in the proof of 
Theorem [Tj to obtain 



»(S*(s\X)) - im(S( 8 \X)) 

= 5>(x = x ; 

h=x 



Sh(s 



0,(1). 



A bootstrap sample of Jzf can be drawn equivalently using a two-stage process by 
first drawing a multinomial vector (n* h ) h from n trials with each cell h — x having 
probability /i(X = x = h) and then drawing a bootstrap sample of size n* h , for 
each h, from Jzf^ = {(Xj, Tj, : X, = x = h}. It is not hard to show that 

n *h/ n h ^ !■> where rih = \J&h\ — Ym=i = x = ^) ~* 00 ■ Thus, to complete 
the proof it suffices to show convergence of «S^(s) — Sh(s), for each /z = x, where 
S*^(s) is derive d from a bootstrap sa mple of size from Jz^. We use part (b) of 
Lemma 3 from lLo and Singhl (|1985|) for this. This theorem applies if s < sup(F) 
under the random censorship model for a continuous survival distribution. All these 
conditions are met under our assumptions. Thus applying this Lemma, one can 
show that 

sup \S%(s) - S h (s) 
«e[o,t] 



* P ( n h 1/2 l °s n l /2 ] 



o* p (l). 
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□ 

Proof of Theorem\3\ It suffices to show that for a given x, we can find an ensemble 
S e (-\x') that is zero if x' ^ x, and for x' = x, uniformly approximates 5'(-|x) over 
[0, s] (this suffices because we can always combine such ensembles to uniformly 
approximate ^(-(x) for all x). Choose Jzff, such that 5b,i = 1 for all b and i. By 
repeated splitting on the left, it is clear we can use d splits to construct a tree Tb 
having d + 1 terminal nodes such that the left-most daughter node corresponds to 
x. Because n > d, we can assign at least one event to the left-most node. For 
concreteness, assume that the node contains exactly one event, with event time T > 
0. Over the remaining d terminal nodes assign event times T = for all cases. Thus 
<Sfc(-|x') = if x' 7^ x. On the other hand, if x' = x, then S&(t|x) is a step function 
with value 1 if t < T and value if t > T. Because 5(-|x) is continuous [by 
definition ©], it is uniformly continuous over the compact set [0, s]. A uniformly 
continuous, monotonically decreasing function over a compact set can be uniformly 
approximated by a linear combination of a finite number of step functions such as 
Sb(- |x). Thus one can construct a finite number of survival trees like Tb, that when 
suitably weighted, uniformly approximates 5"(-|x) over [0, s\. □ 
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